CPSC 545/445 (Autumn 2003) - Class 6: Pairwise Sequence Alignment Module 2, Part 2 --- 2.4 psa with affine gap scores reminder: affine gap scores: gamma(g) = -d - (g-1)*e here: global psa (loccal psa is analogous) general approach: modify take the standard global psa (needleman-wunsch) algorithm as follows: M(i,j) := max { M(i-1,j-1) + s(xi ,yj) M(k,j) + gamma(i-k) k := 0...i-1 M(i,k) + gamma(j-k) k := 0...j-1 } problem: additional iteration over k (= position of last non-gap character) -> O(m*n*max{m,n}) time; in practice, this is typically too slow but note: this works for arbitrary gap scores better: exploit additive structure of affine gap score (this will alow us to achieve O(m*n) time complexity) idea: use a help structure to capture information on start of gaps and accumulated gap penalties -> three cases are pursued simultaneously: "no gap" - best score M(i,j), "(extending) gap in y" = "insertion in x" - best score Ix(i,j) "(extending) gap in x" = "insertion in y" - best scost Iy(i,j) M(i,j) := max { M(i-1, j-1) + s(xi,yj) Ix(i-1, j-1) + s(xi,yj) % end gap in y Iy(i-1, j-1) + s(xi,yj) % end gap in x } Ix(i,j) := max { M(i-1,j) -d % start new gap in y Ix(i-1,j) -e % extend existing gap in y } Iy(i,j) := max { M(i, j-1) -d % start new gap in x Iy(i, j-1) -e % extend existing gap in x } initialization: M(k,0) = Ix(0,k) = Iy(k,0) = M(0,k) = -d-k*e this is based on the assumption that deletion is never directly followed by insertion (true for optimal alignment if -d-e < min s(a,b) for any a,b --- 2.5 other sequence alignment problems: - local alignments with repeated matches useful for finding multiple genes (e.g., tRNA genes) - alignments with overlap matches special case of local alignment can be solves using smith-waterman algorithm, but there is a better algorithm (simple variation of needleman-wunsch) - alignment with hybrid(complex) match conditions these can all be solved using variants of dynamic programming, time/space complexity O(m*n) -> BSA, Ch.2; reading assignment - heuristic alignment -> 2.6 - multiple sequence alignment -> 2.7 --- 2.6 Heuristic alignment idea: reduce complexity of psa algorithms by using heuristic method -> loose guarantee for finding optimal alignment (w.r.t. given scoring model) [why may this not be a big problem in practice?] Here: BLAST (Basic Local Alignment Search Tool; Altschul et al., 1990) goal: find high-scoring local alignment between query sequence y (e.g., specific protein sequence) and long target sequence x / database (e.g., genome, chromosome) basic idea: - a true match alignment likely contains very similar or identical subsequences - find such subsequences and use them as seeds from which to extend longer, weaker but still good alignment - short seeds allows use of table of all possible seeds with starting points (index) constructed from query during preprocessing realisation in BLAST: - parameters w = seed length; t, C = score thresholds (seed / extension) (prot: w=3, dna: w=11) - find all length w substrings of query y that align to a particular substring of sequence x with score > t -> seed table S - scan through sequence x (or database), for each match of subsequence in x with seed in S ("hit"): extend ungapped local alignment in both directions until alignment score drops below best score for shorter extensions - C (implementation uses additional heuristics, special techniques to increase efficiency) different versions for BLAST for DNA and protein alignments; various improvements of basic algorithm, including: - gapped BLAST (for gapped alignment - uses dynamic programming to extend hits), - PSI-BLAST (position-specific iterated BLAST, often detects weaker seq similarities) other heuristic alignment algorithms: FASTA (see BSA, 2.5) --- Resources: BSA, Ch.2 [good introduction; contains further references]